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ABSTRACT 

The purpose of this essay is to describe the 
principles of educational measurement proposed by B. Wood during the 
1920s in his dissertation, written under the direction of E. L. 
Thorndike, and later published as "Measurement in Higher Education" 
(1923) These principles were selected because they illustrate one of 
the earliest and most complete descriptions of a set of basic and 
perennial problems encountered in educational testing. The specific 
questions addressed in this essay are concerned with the following: 
(1) the basic measurement problems identified by Thorndike and Wood 
in the first two decades of this century; (2) the means by which 
these measurement problems appear within the content of educational 
testing according to Wood; (3) means by which these problems were 
addressed by Wood in the 1920s; and (4) contemporary views of these 
problems. Principles of educational measurement (objectivity, defined 
zero and unit, definition of tHe function to be measured, 
consistency, within person variability, comparability, distinctness 
of power and achievement, equal exposure and practice, advantages of 
indirect measurement, test construction, test use, and measurement 
must not be confused with pedagogy) are tabulated according to 
s^pecific problems and proposed solutions to each. Nine pages of 
references are provided. (Author/THJ) 
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Abstract 

The purpose of this essay is to describe the principles of 
educational measurenient proposed by Ben Wood during the 1920 *s in 
his dissertation whidi was written under the direction of E. L. 
Thomdike, and later published as Measurement in Higher Education 
(1923) • These principles vjere selected because they illustrate one 
of the earliest and most conplete descriptions of a set of basic 
and perennial problen^ encountered in educational testing. The 
specific questions addressed in this essay are as follows: What 
were the basic ineasui'ement problems identified by Thomdike and 
Wood in the first two decades of this century? How do these 
measurement problems ^jpear within the context of educational 
testing according to Wood? How were these problems addressed by 
Wood in the 1920 's? And how are these problems viewed today? 
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THORNDIKE'S AMD PTOOD'S PRINCIPLES OF EDUCATIONAL MEASUREMENT: 
TEST THEORY IN THE 1920*sl 

The liistory of science is the history 
of measuremsnt. (Cattell, 1893, p. 316) 

What ever exists at all exists in some 
amount. To know it thoroudily involves 
knowing its quantity as well as its 
quality. (Thomdike, 1918, p. 16) 

In his presidential address to NOME last year, Jaeger (1987) 
reminded the educational measurement community of the inportance of 
periodically reviewing the history of our discipline. He 
eloquently summed up his remarks as follows: 

I would assert that to move forward efficiently we 
must first look back — to incorporate and build upon the 
riches of the past while avoiding futile paths earlier 
e:q)lored and appropriately abandoned. To dwell on the past 
is folly; to ignore it is absurdity, (p. 13) 
This essay is intended to identify viiat I consider one source of 
these "riches". Specifically, I wo^ild like to discuss a fairly 
couple te theory of educational testing proposed by Ben Wood in 
1923 based on the measurement theory of E. L. Thomdike (1904; 
1919). These principles were selected because they can be used to 
illustrate some of the basic and perennial problems encountered in 
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educational measurement, and also provide a useful framework for 
the e^loration of its history. 

What were the basic measurement problems identified by 
rnomdike and Vfood in the first two decades of this century? How 
do these measurement problems appear within the context of 
educational testing according to Wood? And how were these problans 
addressed fay Wood in the 1920 's? This essay is intended to provide 
answers to these questions. Some brief corfiments will also be made 
on how these prir.ciples appear today, 

Thomdike and Wood 

In 1904, E, L, Thomdike published the first edition of his 
hi^ly influential book entitled An Introduction to the Theory of 
Mental and Social Measurements , Thomdike *s major aim in writing 
the book was to 

, , , introduce students to the theory of mental measurements 
and to provide tiiem with such knowledge and practice as may 
assist them to follow critically quantitative evidence and 
argument and to make their own researches exact and logical 

(Thomdike, 1919, p, v) 
Thomdike 's book was the standard reference on statistics and 
quantitative methods in the mental and social sciences for the 
first two decades of this century (Clifford, 1984; Travers, 1983), 
Much of this influence can be attributed to Thomdike 's clear and 
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ej^KDsitory writing style. He explicitly acknowledges that the then 
current work in measurement theory had not been presented in a 
manner suitable for students without fairly advanced mathematical 
skills, and he set out to present a less mathematical introduction 
to measurement theory based on the belief that "there is, h^pily, 
nothing in the general principles of modem statistical theory but 
refined common sense, and little in the techniques resulting from 
them that general intelligence can not readily master" (p. 2). 
Many of us that have struggled with the mathematics of item 
response theory can appreciate Thomdike*s comments, and applaud 
his attenpt. 

Althou^ Thomdike wrote extensively on educational 
measurement/ covering topics viiich ranged from the general 
statement of his theory (Thomdike, 1904; 1919) to the measurement 
of a variety of educational outcomes (Thomdike/ 1910, 1914, 1921)/ 
as wall as intelligence (Thomdike, et al./ 1926), I have found 
that one of the clearest and most conplete statements of 
Thomdike *s measurement theory was presented by his student and 
colleague/ Ben Ptood. In a ch^ter titled "Some Principles of 
Educational Measurement'*, Wood (1923) stated that 

This chapter is little more than an effort to expand that 
treatment [of measurement theory] for the purpose of 
esqposition. Practically all the material in this chapter 
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is taken from Professor Thomdike's well-known treatise, or 
directly inferred from some of its propositions, 

(Wood, 1923, p, 141) 
Wbpd has provided a careful and useful exegesis of Thomdike's 
early work on measurement and its inplications for educational 
testing. Wood's work provides the structure for discussing the 
principles of educational measurement presented here, 2 

What were the basic measurement problems identified by 
Thomdike and Wood? Tnomdike clearly stated that ttie "special 
difficulties" of measurement in the behavioral sciences are 

(1) Absence or iiiperfection of units in vdiich to measure; 

(2) Lack of constancy in the facts measured; 

(3) Extreme conplezity of the measurements to be made. 

In order to illustrate the problems related to the absence of an 
accepted unit or measurement, Thomdike (1919) pointed out that 
the spelling tests developed by Joseph Mayer Rice (Graham, 1956) 
did not have equal units. Rice assumed that all of his spelling 
words were of equal difficulty, viiile Thomdike argued that the 
correct spelling of an easy versus a hard word did not reflect 
equal amounts of spelling ability. Because the units of 
measurement are unequal, Thomdike asserted that Rice's results 
were iriaccurate. Without general agreement on units, the meaning 
of our test scores become more subjective. 
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Inconstancy is the second major measurement problem identified 
by Thomdike (1919). Many of the measurement problems encountered 
in the behavioral sciences are related to the random variation 
inherent in many human cteracteristics. Not only are these 
variations due to the unreliability of our tests, but they also 
reflect within subject fluctuations. For example, if we n^asure a 
person's motivation, or even body tenperature repeatedly, these 
values tend to vary. 

The final measurement problem or "special difficulty" 
identified by Thomdike pertains to the extreme complexity of the 
variables and constructs that we wish to measure. Most of the 
variables worth measuring in the behavioral sciences, such as 
mathematics ability, intelligence, competitiveness, do not readily 
translate into unidijnensional tests vdiich permit the reporting of a 
single score to represent the individual's location on the 
variable . 

Some Principles of Educational I^feasurement 
In addressing the three "special difficulties" identified by 
Thomdike within the context of education, Wood (1923) identified a 
set of sixteen principles vixich included technical recommendations 
on test construction, as well as more policy-oriented issues 
related to test use in education. One of Wood's major concerns was 

o 
O 
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with how the new objective items could be used to solve a number of 
measurement problems in hitter education. A summary of Wood's 
principles, problems and proposed solutions is given in Table 1. 



Insert Table 1 about here 



I should also point out, as Wood did, that these principles were 
intended to be taken in concert as solutions to the three problems 
in measurement identified by Thomdike. In the following sections, 
each of Wood's principles will be presented and discussed. 
Objectivity 

Both Thomdike and Wood considered objectivity to be one of 
the )X^st inportant characteristics of a valid test. According to 
Thomdike (1919), "a perfectly objective scale is a scale in 
respect to vAiose meaning all conpetent thinkers agree" (p. 141). 
How can agreement on the meaning of the scores on a test be 
obtained? Thomdike (1919) proposed the creation of a set of 
standard items calibrated onto a scale vAiich would be used as a 
"common measuring stick", viiile Wood (1923) addressed this 
measurement problem in terms of the objectivity of the scoring 
method. To quote Wood (1923), "the True-False test is a good 
exanple of an objective mental scale. No conpetent person would 
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disagree in rating a True-False p^)er, provided they used the key 
viiich acconroanies the test'' (p. 144). Anticipating the idea that 
reliability is necessary but not sufficient to establish the 
validity of a test, Wood (1923) stated that "it is perfectly 
possible to have a very objective scale without having one ;^hich 
jneasures the facts to be measured" (p. 144). 

From a current perspective, Wood (1923) clearly was dealing 
with a problem related to the reliability of the test scores, 
althou^ the more general view of this principle based on Thomdike 
(1919) suggests that Thomdike also included aspects associated 
with validity. The meanintj' of test scores, and any consensus 
about their meaning, would involve establishing both their 
reliability and validity. I4any current measurement text^:xDoks use 
the term "objectivity" of scoring much as Wood did (Anastasi, 1988; 
Cronbach, 1984). Further, the word "objectivity" is used in 
another way in the measurement theory of Georg Rasch (1977, 1980). 
According to Wright and Stone (1979), two conditions are necessary 
for objectivity as viewed by Rasch, and these are (1) the 
calibration of the measurement must be independent of those objects 
that happen to be used for the calibration and (2) the measurement 
of objects must be independent of the instrument that happens to be 
used for measuring. 
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Reference to a defined zero point 
in terms of a defined unit 

One of the key problems in educational measurement is the 
establishment of scaling TKthods v^ch provide meaningful and 
interpretable test scores. The solution to this problem according 
to Thomdike and Wood was based on the development of scales with 
defined zero points, either arbitrary or abso.Iute, and the 
selection of a stable unit nf measurement. The solution proposed 
by Wood (1923) was based on a si-scors transformation with the mean 
defining an arbitrary zero point, and the standard deviation as the 
unit. Wood selected the mean and standard deviation because of 
their relative "stability". Wood (1923), also dealt with another 
aspect of the scaling problem related to the conroarability of test 
scores vdiich would be viewed today as an equating problem. In his 
words, 

the same test ^splied to different groups gives both different 
points of origin and different Standard Deviations. Universal 
conparisons can therefore be made only when measurements are 
expressed in terms of the Standard Deviation (and reckoned 
from the Mean) , of some defined and standard distribution. 
(Wood, 1923, p, 150). 
Current approaches to the problem of scaling include a v^ole array 
of methods for equating based on classical test theory and item 



Principles of Educational Measurement 

11 



response theory (Brennan, 1987; Skaggs & Lissitz, 1936; Yen, 1986). 
The Principle of Ete^'jiltion 

In his third F^'inciple, Wood (1923) returned to a question 
coimect^d to the validity cf the test scores. Validity refers to 
the appropriateiiess, msaningtnlness and usefulness of the 
inferences vdiich can be made from the test scores (Standards for 
Educational and Psychological Testing, 1C35). The basic question 
is as follows: What is the test actually measuring? Wood (1923) 
proposed that a precise operational definition of the construct be 
used to answer this question, and that this definition would make 
clear vA)at the test measured. Wood's view here is close to the 
modem idea of content validity which is not too surprising giv-en 
his focus on educational achievement test"*. Neither Thomdike nor 
Wood, included th^ broader validity issues inplied by the question 
raised in this section — what is the test measuring? — vAiich 
would ir. lude obtaining criterion-related and construct-reJated 
evidence relevant to tl\is question. Recent arguments have been 
made for the in^xDrtance of construct validity as well as content 
validity for achievement tests (Haertel, 1985). Under the 
principle of d f-^Bition, Ptood (1923) also anticipated problems 
rel?^^ ^ the devslopment of operational definitions for conplex 
• •^ ^ as reading achievenent and intelligence. 
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Consistency 

A recnjirring problem in educational and psychological 
moasuremsnt relates to the complexity of the constructs that we 
wish to measure. This "extreme ccirplexity" is dealt with by Pfood 
(1923) in terms of the concept of "consistency" vjhich mi^t be 
called unidinensionality today. In Wood's exanple, he points out 
that a "notable exanple of obvious iirourity of measurement is 
afforded by some arithmetic tests . . . problems in these tests are 
but little more than very severe reading tests ... it would seem 
more advantageous for all purposes of measurement to separate the 
two functions" (pp. 154-155) . Unfortunately, the dimensionality of 
a set of test itejiis can not be adequately assessed simply by 
examining the content of the items. Ecm do we really know that 
vdien an individual responds to a set of test items, he or she is 
reaiiy only using one ability or many? In many instances, useful 
test scores are produced by summing vAiat Thomdike and Wood might 
view as "inconsistent items". The early Binet and Simon test was 
criticized on this basis by Spearman (1927) , vdio referred to their 
Intelligence test as a set of "hotd^t procedures" (p. 66) . Wood 
did not have adequate procedures for dealing with this problem, and 
exciting current work in item factor analysis fBock, Gibbons and 
Muraki, in press; Mislevy, 1986; Muthen, 1984} s contributed to 
the problem of assessing the "consistency" or dimensionality of our 
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tests in education and psychology. 
Within Person Variability 

In this fifth principle, Wood (1923) is concerned with the 
problem of "variability of mental functions in the same individual 
from day to day and hour to hour" (p.l55). Usually, we think of 
individuals as having fairly stable behavioral characteristics. 
These characteristics are not really fixed, but can be viewed as an 
average over a number of observations. This intra- individual 
variability may be due to a variety of factors, such as boredom, 
anxiety, fatigue or illness, and must be taken into account in 
measurement. If the intra- individual variability in responses is 
great, then the problem of identifying differences betv?een 
individuals becomes more difficult. In order to address this 
problem. Wood (19P.3) recommends administering as many items as 
possible. In his words, "Only by taking a large sample of an 
individual's performances can we arrive at a reliable estimate of 
his normal or average ability" (p. 151). When a nore conplex 
variable is measured, such as reading ability, then it is even more 
iirportant to increase the number of items. Wood (1P23) referred to 
this issue as the "principle of increasing accuracy" (p. 151). 
This principle would be viewed from a current perspective as 
dealing with the reliability of the scores and the standard error 
of measurement which provides an index of this response 
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variability. It is well known, all else being equal, that we can 
increase the reliability of test scores by increasing the number of 
items because a better sanple of the content domain can be 
obtained, Generalizability theory (Cronbach, et al. , 1972) 
provides an ^jproach viiich can be used to examine various sources 
of random variation which can be useful in addressing this 
measurement problem. 
Comparability 

This principle deals with a problem related to test iise. The 
word "coiroarability" is used because Thomdike and Wood believed 
that once a test had been calibrated, the application of this test 
involved a comparison between the test and the person to be 
measured. This idea can be visualized more clearly if we think of 
the problem of measuring vjriting ability using a standard set of 
essays. Once these essays have been calibrated from poor to 
excellent, a judge "coirpares" each new essay to the set of 
standards in order to define the level of writing ability ref lacted 
in each essay. This measurement problem relates to the question of 
vAiether or not the test can be validly applied with reasonable ease 
and accuracy to the objects being measured. As an example, a 
bathroom scale is not accurate enough to use in weighing gold. 
Wood's proposed solution was to select an appropriate test to 
measure the construct of interest. In grading an essay, the topics 
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adciressed in the calibrated essays should be the same as those the 
examinees are writing about. 
FcMev and Achievement 

Wood stressed that the distinction between "power" 
(intelligence) and "achievement" must be kept in mind in the 
construction, administration, and interpretation of all test 
results. In order to illustrate this principle, he loses as an 
exanple the problem of placing two students with very different 
backgrounds, one from a rural setting and the other from an urban 
setting, in reading ability groups. The reading achievement score 
of the urban child was higher than the rural child's score, and the 
teacher planned to place the rural child in the lowest reading 
group. Additional information was available on the Teiinan 
intelligence test which indicated that the rural child had an I.Q. 
of 130 and at the tuiging of Wood, she was placed in a hi^r 
reading group. The subsequent reading achievement was quite high. 
The major point here seems to be that these two types of test can 
provide different information about an individual differences, and 
that this information can be useful in educational planning and 
decision making. Recent views of intelligence testing suggests 
that the distinction between intelligence and achievement as 
measured by current IQ tests may not be as clear as previously 
believed (Anastasi, 1983). Further, many intelligence test used 
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today in schools have been renamed as tests of "school ability", 
"scholastic ability" and "academic aptitude" (Beck, 1986) vdiich 
indicates that these types of tests, whatever they are called, can 
serve inportant functions for education similar to those 
envisioned by Wood. 

Principle of Equal Exposure and Practice 

How can differences in opportunity to learn be addressed viien 
testing general intelligence? In answering this question VIood 
(1923) stated that "inferences as to the general intelligcence or 
inborn ability of two individuals must be based -apon their 
reactions to material to which they have been equally eroosed and 
in vdiich they have had equal practice, except insofar as exposure 
and practice are influenced by native capacity" (p. 158) . In order 
to minimize the effects of opportunity to learn. Wood recommended 
that "enphasis should be placed upon testing mental processes which 
are largely independent of informational content" (p. 160) , vAiile 
recognizing that differences in past e:qxDsure can never by 
conpletely eliminated. In situations viiere there are large 
differences in the home or social environment, these must be 
considered in explaining differences in achievement, general 
intellect and special abilities. Pfood's views are fairly modem, 
althou^ he does seem to be a bit optimistic about the possibility 
of controlling for these differences in opportunity to learn. The 
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problem of ejqwsure and practice is still an inportant issue 
because it can have a significant inpact on the way in vdiich test 
scores are interpreted. If two children are tested on a common set 
of educational objectives, and one has not had the opportunity to 
learn the objectives is it fair to conpare children on this test? 
Do the scores have the same meaning? This problem is reflected in 
current issues related to cxistomized testing (Yen, Green & Burket, 
1987) , and curricular validity (Mehrens & Phillips, 1987) . There 
seems to be general agreement that if opportunity is an important 
factor, then it must be taken into account in the interpretation of 
the test scores, however, the methods* for doing this are still the 
subject of debate. 
Advantacres of Indirect Measurement 

This principle treats two related problems — Vlhat are the 
advantages of objective items vdiich Wood (1923) called "indirect 
measurement" as compared to essay items? Or more broadly 
conceived, viiat is the best type of item to measure a construct? 
This principle is concerned with the disadvantages of essays as an 
item type, and Wood's advocacy of "new-type or objective" 
examinations in education. According to Wood (1923), "the essay 
examination in the nands of the average teacher does measure a very 
inportant element viiich apparently cannot be n^asured directly by 
any other means thus far developed. But it measures that element 
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very incoirpletely and very unreliably" (p. 161). Vfood (1923) 
identified two major weaknesses in essay exams, (1) inadequate 
samplir:g of examinee performance and of material and (2) 
variability and uncertainty in the subjective methods used to score 
essays. He presents a strcai^ case for the use of objective items 
types, such as conpletion, true-false and recognition items, in 
higher education, and recommends that "where indirect methods have 
demonstrable advantages over direct measurements, indirect 
measurements should be used" {p. 151). In spite of his arguments 
against essay type items, he still felt that essay items played an 
important role. In his words, "indirect measurement is not 
suggested as a substitute for, but as a supplement to, direct 
measurement" (p. 161). From the perspective of the 80 's, both 
methods would be viewed as "indirect" as opposed to "performance- 
type" tests (i\nastasi, 1988). Ther:- is little if any debate over 
the -usefulness of "indirect measures", such as multiple choice 
items today. Tte debate today centers on vAien a particular item 
type is appropriate. Althou^ Wood (1923) was discussing the use 
of essays to assess achievement in the content areas, a similar set 
of concerns appear today in the use of essays to measure writing 
ability. Essay type items and the assessment of writing ability 
are being increasingly used in state and national assessment 
programs as well as a part of standardized achieveirent tests 
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(Quellmalz, 1986; Special Issue on Writing Assessment, 1984), 
althou^ scoring issues remain (Chase, 1936). 
T est Construction 

Wbod*s next 3 principles are related to test construction , and 
the steps in test construction that can increase the validity of 
the test scores. According to Wood, (1) "a valid test must contain 
a larger nxamber of small elements" (p. 163), (2) "in the 
measurement of any mental function as many types of questions 
should be enployed as administrative conditions allow" (p. 165) and 
(3) "the questions should involve as little as possible irrelevant 
considerations and superfluous activities on the part of the 
examinee" (p. 168). The principle of "many sirall elements" in (1) 
above reflects Wbod*s case against the use of essay items. Most 
educational tests created today do net follow the recommendation 
regarding the use of multiple item types as suggested in (2) aba^je. 
Gulliksen (1986) has attributed this to "the failiare to distinguish 
between the requirements of standardized testing and classroom 
testing seems to be responsible for the lack of inprovement — and 
perhaps even a decline — in the quality of teacher-made classroom 
tests over the past 40 years" (p. 189). Gulliksen (1986) goes on 
to call for the xise of a variety of item types by teachers. Wood 
(1923) recommended seven conditions for constructing a "good" item 
which are commonly recommended today within standard texts on 
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educational testing. For exairple, Wood suggested that the items 
should not contain "trick" elements and chance inf Ixiences should be 
minimized. A current topic in test construction not directly 
addressed by Thomdike and Wood is how to detect item bias (Linn & 
Drasgow, 1987; Shepard, Camilli & Williams, 1985) • Given the 
social context of their work, this was sinply not a measurement 
issue at the time. Croribach (1975) and Haney (1981) provide 
interesting and useful discussions of the interplay between social 
concern, policy and testing. 
Test Use 

The next 3 principles refer generally to test use , and the 
match between persons and items in terms of appropriateness. Wood 
was concerned with the adequacy of a test in terms of measuring the 
whole range of a construct for a particular group. The problem 
would be evident if the test was too easy or too hard for the 
examinees, and the test scores would not be distributed on the 
variable. In other words, the test would not be able to detect 
individual differences — the sine qua non of measurement. If the 
test is "appropriate" then "it must be sensitive to and capable of 
registering real differences in every part of the range of the 
quality it is designed to measure" (p. 171). Further, Wood (1923) 
points out that "no absolute criterion is available to show viiether 
an exam fully satisfies this condition, but fairly secure indices 
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are not wanting" (p. 170) . Wood also had a parallel concern with 
the distribution of the items on the scale. The measurement goal 
is to de^;elop a set of items that are in the appropriate range of 
difficulty for a group of exandnees. When an individual encounters 
an "inappropriate" item, guessing and o"cher chance influences can 
interfere with ttiB measurement process. Wood pointed out that 
"chance influences must be rec-ognized and countered in the 
construction, scoring, and evaluation of every type of question" 
(Wood, 1923, p. 172). The concerns eroressed by Wood could not be 
handled adequately with the methods available in the 1920 *s. From 
a current perspective. Wood's concerns here cculd be examined with 
a "map of the variable" (Wri^t and Stone, 1979) vihich provides a 
graphic display which sho^ simultaneously the location and 
distribution of items and individuals on the variable, i'urther, 
recent work on comjputerized adaptive testing (Green, et al., 1984; 
Weiss, 1982) is explicitly motivated by this concern with the match 
between items and individuals in terms of appropriate item 
difficulty. Additional work on appropriateness indices provides 
another approach vdiich can be \ised to examine the validity of 
individual test scores (Drasgow, Levine, & Williams, in press). 
Measurement and Pedagogy 

Wood (1923) believed that in the construction and 
administration of examinations, measurement must not be confused 
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with pedagogy. Wood is again defending the use of objective items 
whicii had been apparently criticized as having no pedagogical value 
as conpared to essay items. His major point is that although both 
types of items can have pedagogical value, the value of an 
examination must be assessed separately in regards to these two 
issues. According to Wood (1923), "intrinsic pedagogical value in 
an examination is hi^ly desirable, but the value of the 
examination as a measuring device cannot be made to depend on its 
value as a teaching device (p, 174) . Today many uses of tests 
involve the explicit development of a link between testing and 
instruction (Airasian & Madaus, 1983; Burstein, 1983; Glaser, 1986; 
Stiggins, Conklin & Bridgeford, 1986), 

Discission and Iirplications 
In many ways, we have nade a great deal of progress in 
psychometrics that Thomdike and Wood could not have anticipated. 
Recent advances in measurement theory (item response theory, 
generalizability theory and factor analysis), conpiter technology 
(computerized ad^tive^ testing, video discs), and statistical 
methodology (probabilistic models for analyzing qualitative data 
and Bayesian methods) make possible solutions to many of our 
measurement problems viiich were undreamed of in the 1920 *s. And 
yet considering the basic measurement problems identified by 
Thomdike and the principles of educational measurement proposed by 
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\ftood, it is hard not to be iirpressed, and it is easy at t lines to 
forget, that many of these ideas were first expressed by Thomdike 
almost 85 years ago and by Wcxxl over 65 years ago. 

The "special difficulties" of measurement in the behavioral 
sciences are still present today. Generally agreed xxpon units are 
not available for many variables of interest, himan characteristics 
still shew random variation, and there is little doubt that the 
variables vAiich we v?ish to measure are still complex. What seems 
to have changed the most is not the basic questions or problems of 
measurement, but our ingenuity and technical finesse in finding new 
solutions for old problems. Althourfi in some cases, early 
solutions aised by Thomdike, such as item scaling, worked 
remarkably well (Engelhard, 1984). Classical test theory was still 
in its infancy vdien Thomdike and Wood conducted their research and 
proposed their measurement theories, and modem measurement 
theories, such as item response theory and generalizability theory 
were of course not developed yet. Many of the basic problems in 
measurement were identified at the beginning of this century, while 
the solutions offered have changed over time as new measurement 
theories are created. 

It is hoped that this essay will generate some additional 
interest in the history of educational test theory. For example 
Haney and Reidy (1987) report finding only seven references that 
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deal directly with the history of educational testing in America in 
the entire ERIC data base, and I have had a similar experience with 
PsychLIT vAxich is based on Psychological Abstracts , Early books on 
this topic, such as Linden and Linden (1968) and DuBois (1970) are 
new about 20 years old. Several historical articles on mental 
testing (Clarke & Clarke, 1985), educational testing (Resnick, 
1982), educational assessment (McArthui*, 1987), educational 
evaluation (Madaus, Stuff lebeam, & Scriven, 1983) and employment 
testing (Hale, 1982) are available, but no book-length treatment 
has been published recently. Sokal (1987) has edited a volume on 
psychological testing and American society, and has made some 
concrete suggestions about ^proaches to the history of 
psychological testing (Sokal, 1984). In her recent review of two 
new books on the history of statistics (Porter, 1986; Stigler, 
1986), Cowan (1987) has made an important distinction between 
histories of a discipline written by insiders versus outsiders. 
There is a clear need for both versions, but an updated history of 
the key ideas underlying measurement theory which does for 
psychometrics viiat Stigler (1086) as an "insider" has done for 
statistics is required. Since test theory is approaching its 
century mark, if we consider the Cattell article in 1893 as its 
birth, it would seem that a conprehensive history is somevAiat 
overdue. I'm currently working on a project with the generous 
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support of a Spencer Fellowship from the Nation^il Academy of 
Education vdiich focuses on the conparative and historical 
de^/elopjasnt of several measurenent theories which I hope will 
contribute to this history of test theory. 

In conclusion, I hope that this essay illustrates soms of the 
insights that can be gained from a careful analysis of earlier 
TOrk on educational testing. In presenting the insasureiDent 
problems, I have not iiitended to provide an exhaustive diccussion 
of how these problems would be addressed within the context of the 
major modem measurement theories, such as item response theory 
{Lord, 1980? Wridit and Masters, 1982), general izability theory 
{Croribach, Gleser, Nanda & Rajaratriam, 1972; Brennan, 1983) or 
factor analysis (Joreskog & Sorbom, 1986). Many of the problems 
identified by Ihomdike (1919) and Wood (1923) could be the basis 
of articles in themselves, and my goal has been to provide a 
general overview, rather than a great deal of depth. 

Jaeger (1987) in his presidential address posed tJie following 
question: Pore's the revolution!? One p-'zrtial answer is that we 
have not had a revolution, but maybe some "evolution", in t irms of 
the measurement problems we seek to solve. Another answer midit be 
that in some areas, our new theories of msasuremsnt and 
technological advances viiich deal with these problems are indeed 
revolutionary when viewed from the perspective of the 1920's! 
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Reference Notes 



1 . Support for this research was provicJed throu^ a Spencer 
Fellowship from the National Academy of Education. Aa earlier 
version of this paper was presented at the annual meeting of the 
American Educational Research Aj^sociation in New Orleans (i^ril, 
1988) . 

2. Although most of vis are fairly familiar with E. L. Thomdike 
and his life, Ben Wood may not be as well known. Ptood was 
involved, along with William S. Learned, in the Pennsylvania Study 
which was supported by the Carnegie Foundation from 1928-1932 
(Resnick, 1982). One of the major outcomes of this study 

was to encourage hi^ schools and colleges to keep cumulative 
records of their students. Wood also played a major role in the 
development of the Cooperative Test Service in 1930, as well as in 
the early development of the National Teacher Examination (Downey, 
1965) . 
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Table 1 

Summary of the Principles of EcJucational Measurement 





Principle 


Problem 


Proposed Solution 


Objectivity 


Hew can agreement on 
the meaning of a test 
be increased? 


Development of a common 
measuring stick; reduce 
variation due to scoring 
method 


Defined zero 
and unit 


How can the location 
and unit of measurement 
be adequately defined? 


Use Mean for location and 
SD for unit because of 
their relative stability 


Definition of 
function to 
be measured 


What is the test 
measuring? 


Use clear operational 
definition of construct 


Consistency 


Is the test 
unidimens ional ? 


Minimize obvious 
impurities 


Within person 
variability 


How can response errors Increase number of 
due to intra-individual items 
variability be minimized? 


Coirparability 


Can the test be validly 
applied with reasonable 
ease and accuracy to the 
objects to be measured? 


Select an ^jpropriate 
test to measure 
the construct 
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Table 1 (cont.) 





Principle 


Problem 


Proposed Solution 


Power and 
achievenient 
are distinct 


What is the difference 
between intelligence 
and achievement? 
Why is it iirportant? 


Difference must be kept 
in mind for interpre- 
tation and use of 
test results 



Equal eaqposure How can differences in Tests should be free of 

and practice opportunity to learn informational content; 

be addressed vAien testing Take into account if 

general intelligence? control is not possible 



Advantages of 

indirect 

measurement 

Test 

Construction* 



Test use* 



What are the advantages 
of objective items as 
conpared to essay items? 

What are the steps in 
test construction that 
can increase validity? 



What are the steps in 
test use that can 
increase validity? 



Use objective 
items to increase 
reliability 

Increase number of 
items; xise multiple 
item types; construct 
"good" items 

/^ropriateness of the 
match between persons 
and items; reduce 
chance influences. 



Measurement What is the distinction Treat testing and 

must not be between measurement and educating separately 

confused with pedagogy? WDby is it 

pedagogy inportant? 



* Note . Six principles treated separately by Wood have been 
grouped under test construction and test use. 
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